OCR Correction and Query Expansion for Retrieval on OCR Data -- CLARIT TREC-5 Confusion Track Report

نویسندگان

  • Xiang Tong
  • ChengXiang Zhai
  • Natasa Milic-Frayling
  • David A. Evans
چکیده

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Report on the TREC-5 Confusion Track

For TREC retrieval from corrupted data was studied through retrieval of single target documents from a corpus which was corrupted by producing page images corrupting the bit maps and applying OCR techniques to the results In general methods which attempted a probabilistic estimation of the original clean text fare better than methods which simply accept corrupted versions of the query text

متن کامل

Revisiting Known-Item Retrieval in Degraded Document Collections

Optical character recognition software converts an image of text to a text document but typically degrades the document’s contents. Correcting such degradation to enable the document set to be queried effectively is the focus of this work. The described approach uses a fusion of substring generation rules and context aware analysis to correct these errors. Evaluation was facilitated by two publ...

متن کامل

The TREC-6 Spoken Document Retrieval Track

The Text REtrieval Conference (TREC) workshops provide a forum for di erent groups to compare retrieval systems on common retrieval tasks. The 1997 TREC workshop will feature a Spoken Document Retrieval task for the rst time. This paper motivates the task and describes the measures to be used to evaluate the e ectiveness of the retrieval methodologies. 1. The Text REtrieval Conference The Text ...

متن کامل

A Content-based Probabilistic Correction Model for OCR Document Retrieval

The difficulty with information retrieval for OCR documents lies in the fact that OCR documents comprise of a significant amount of erroneous words and unfortunately most information retrieval techniques rely heavily on word matching between documents and queries. In this paper, we propose a general content-based correction model that can work on top of an existing OCR correction tool to “boost...

متن کامل

RMIT University at TREC 2008: Legal Track

This paper reports on the participation of RMIT university in the 2008 TREC Legal Track Ad Hoc task. OCR errors can corrupt the document view formed by an information retrieval system, and substantially hinder the successful retrieval of relevant documents for user queries. In previous research, the presence of errors in OCR text was observed to lead to unstable and unpredictable retrieval effe...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1996